1. Introduction

HaploQA is a web application developed by Kieth Sheppard for The Reinholdt’s Lab in 2015 for the purpose of storing, increasing accessibility and quality controlling of micro array mice genotyping data. According to the HaploQA about website, ‘On the import of a sample, the haplotype probabilities that are represented in the plots are calculated.’ and the haplotype reconstruction process for the contributing founder strains is then done based on the calculated probabilities. The source codes of HaploQA was written in Python 3 and JavaScript.

R/qtl2 is an open-source software package built in 2019, which not only performs similar haplotype reconstruction tasks as HaploQA but also has the functionality to perform other computations such as genetic mapping - in the pipeline developed for this project,the software takes genotyping data as input, calculate the haplotype probabilities based on a Hidden Markov Model algorithm, then perform haplotype reconstructions based on the quantitative results.

Since R/qtl2 was built more recently and is currently maintained by the developers, the Jackson Laboratory is considering transitioning from HaploQA to R/qtl2 generalized AIL (Advanced intercross lines) for similar haplotype reconstructing tasks with genotyping data. The main purpose of this project is to decide if the transition should be made determine based on whether HaploQA or R/qtl2 genail provides more accurate haplotype reconstruction results.

The best way to determine which method provides more accurate results is to compare the results of the two methods each to the ‘correct’ results - in R/qtl2, there exists some cross-specific models which the Hidden Markov model was tailored to generate the most accurate results for a particular cross. These models will be referred to as the ‘truth’ models in this document.

In this project, we selected some GigaMUGA crosses, developed pipelines that retrieve genotyping data used in the initial HaploQA algorithm, performs haplotype reconstruction computations for both qtl2 genail and haploqa, and compare the percentages of markers within each method that agree with the optimal model of the according crosses to determine whether HaploQA or R/qtl2 gen-AIL would be the most appropriate for similar haplotype reconstruction tasks. The analysis was also done for MiniMUGA, as it is the standard cross type JAX uses for mutant mice, however, there is no tailored truth model for MiniMUGA, so the amount of marker differences between HaploQA and qtl2 was used as the main metrics for MiniMUGA to determine the model performances.

2. Methods

2.1. Environment set-up

The main repository is https://github.com/TheJacksonLaboratory/HaploQA_qtl2_comp which contains the main pipelines developed for this project, all utility functions, and visualizations of selected results.

There is no need to pre-download anything from any other websites or manually create any directories, as those processes were automated by the pipelines. However, the automation depend heavily on the CSV files stored in the directory, and without proper set-ups of the CSV files, the pipeline will not run. Please make sure the following CSVs are present in the directory after cloning the repo:

  1. annotations_config.csv - configuration file containing all technical set-ups for the pipeline, details will be explained in the next section.

  2. founder_lookup_table.csv - lookup table containing all possible geno codes (AA, BB, AB, AY, BY, etc.) and their according number codes (1, 2, 3, 4, etc.)

  3. founder_strains_table.csv - lookup table containing the standard 8 founder strains (A/J, C57BL/6J, etc.) and their according letter codes (A, B, etc.) This is primarily used for CC and DO to ensure the 8 founders are encoded in the correct order.

The pipelines developed for this project is stored in the script, ‘haplotype_reconstruction_pipelines.R’, which is stored in the same working directory as this report and is sourced below:

library(rstudioapi)
library(qtl2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ readr::read_csv() masks qtl2::read_csv()
library(httr)
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(qtl2convert)
library(ggrepel)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(stats)
library(ggplot2)
library(stringr)
library(stringi)
library(lsa)
## Loading required package: SnowballC
library(readr)

root <- dirname(getSourceEditorContext()$path)
#source(paste0(root,"/input_data_prep_functions.R"))

source(paste0(root,"/haplotype_reconstruction_pipelines.R"))
results_dir <- file.path(root, 'results')
shiny_pct_fp <- file.path(results_dir, 'shiny_pct_csvs') 

# simulated phenotypes
list_pheno <- c('WBC', 'NEUT')

There are three major pipelines in the repository:

  1. ‘haplotype_reconstruction_pipeline’ - haplotype reconstructions for an entire cross (all individuals within the cross)

  2. ‘sample_haplotype_reconstruction’ - haplotype reconstructions for one individual

  3. ‘truth_model_reconstruction’ - haplotype reconstructions for optimal/truth models

Some crosses, such as DO and CC, have the same contributing founders across all individuals. The haplotype reconstructions can therefore be conducted for all individuals in the entire cross at the same time, and the pipeline ‘haplotype_reconstruction_pipeline’ can be used for crosses with consistent contributing founders for such purpose. Whereas the mutant crosses, such as MiniMUGA, contains individuals with mutate mice where all individuals may have different contributing founder strains. Haplotype reconstructions for these types of crosses need to conducted individually for each sample, and therefore, the pipeline ‘sample_haplotype_reconstruction’ should be used.

To run a cross, make sure to modify the configuration file, ‘annotations_config.csv’, that’s stored in the environment. This file allows the pipeline to automatically obtain all relevant information associated with each cross such as the sample URL, input data directory, output data directory, marker annotation file names, etc. The purpose of this configuration file is to avoid having to pass a large list of parameters into each pipeline function.

A preview of the configuration file is shown as below:

config <- fread(paste0(root, '/annotations_config.csv'))
print(config)
##    array_type        annot_file          data_dir              qtl2_dir
## 1:         CC   gm_uwisc_v2.csv     haploqa_cc_gm        cc_qtl2_genail
## 2:   MiniMUGA mini_uwisc_v2.csv  haploqa_minimuga  minimuga_qtl2_genail
## 3:         DO   gm_uwisc_v2.csv        haploqa_do        do_qtl2_genail
## 4:        BXD   gm_uwisc_v2.csv    haploqa_bxd_f3    bxd_f3_qtl2_genail
## 5:         F2   gm_uwisc_v2.csv        haploqa_f2        f2_qtl2_genail
## 6:  MURGIGV01   gm_uwisc_v2.csv haploqa_gm_mutant gm_mutant_qtl2_genail
##    marker_type
## 1:    GigaMUGA
## 2:    MiniMUGA
## 3:    GigaMUGA
## 4:    GigaMUGA
## 5:    GigaMUGA
## 6:    GigaMUGA
##                                                                                        url
## 1:                               http://haploqa-dev.jax.org/tag/Collaborative%20Cross.html
## 2:                                            http://haploqa-dev.jax.org/tag/MiniMUGA.html
## 3:                                                http://haploqa-dev.jax.org/tag/J:DO.html
## 4:                                http://haploqa-dev.jax.org/tag/BXD%20F3%20re-upload.html
## 5:                                                  http://haploqa-dev.jax.org/tag/f2.html
## 6: http://haploqa-dev.jax.org/tag/Jackson_Lab_Mahaffey_MURGIGV01_20160102_FinalReport.html
##                                                                    founders_list
## 1: 129S1/SvImJ, A/J, C57BL/6J, CAST/EiJ, NOD/ShiLtJ, NZO/HlLtJ, PWK/PhJ, WSB/EiJ
## 2:                                                                              
## 3: 129S1/SvImJ, A/J, C57BL/6J, CAST/EiJ, NOD/ShiLtJ, NZO/HlLtJ, PWK/PhJ, WSB/EiJ
## 4:                                                              C57BL/6J, DBA/2J
## 5:                                                         129S1/SvImJ, C57BL/6J
## 6:                                                                              
##    ngen
## 1:    4
## 2:    4
## 3:    4
## 4:    3
## 5:    3
## 6:    4
##                                                                                            exclude_samples
## 1:                                                                                 AF8, AF9, AFA, AFB, AFD
## 2: DJN, DK5, DK9, DKA, DKK, DM6, DMD, DMK, DMQ, DN2, DNQ, DPG, DPS, DPX, DQ5, DRD, DS5, DSD, DW6, DWW, DWY
## 3:                                                                                                        
## 4:                                                                                                        
## 5:                                                                                                        
## 6:                                                                                                        
##                                                                       founder_url
## 1: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 2:                         http://haploqa-dev.jax.org/tag/MiniMUGA_Reference.html
## 3: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 4: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 5: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 6: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
##    truth_model
## 1:      risib8
## 2:            
## 3:          do
## 4:         ail
## 5:          f2
## 6:

If the target cross exists in the config file, the pipeline will:

  1. get the filename entered in the ‘annot_file’ column as the annotation file from ‘https://raw.githubusercontent.com/kbroman/MUGAarrays/main/UWisc/

  2. get the input data files for all individuals in the cross from the HaploQA url entered in the ‘url’ column

  3. create a folder in the working directory (if not already exists) with the name entered in ‘data_dir’ column and outputs the scraped input files to this directory

  4. create a folder in the working directory (if not already exists) with the name entered in ‘qtl2_dir’ column to output all the files for qtl2 input

  5. get the founder data from the url given in ‘founder_url’ column

  6. run qtl2 with the given value in the ‘ngen’ column

  7. identify the appropriate crosstype to get the optimal model for the given cross using ‘truth_model’

  8. identify the founders present in the cross. This should be left blank for the mutant crosses or any other crosses where the number of founders ia not consistent between individuals

  9. use ‘marker_type’ to identify the conditions of certain if loops in the pipeline. Should be ‘GigaMUGA’ or ‘MiniMUGA’ for the current pipelines

If the cross that the user is running haplotype reconstructions for does not exist in the config file, add a new row and fill each column accordingly.

Below is an example of how the pipeline set-up looks like for CC:

# cross type
sample_type <- 'CC'
### Environment
config_sample <- config[config$array_type == sample_type] # get config info for this cross type
marker_type <- config_sample$marker_type # type of marker (GigaMUGA/MiniMUGA)
founders_list <- unlist(strsplit(config_sample$founders_list, ", ")) # contributing founders
n_founders <- length(founders_list) # number of founders used in reconstruction
ngen <- config_sample$ngen # ngen parameter used in cross
sample_url <- config_sample$url # URL to retrieve data from haploqa for this cross
founder_url <- config_sample$founder_url # URL to retrieve founder data for GigaMUGA/MiniMUGA
truth_model <- config_sample$truth_model # cross type of truth model for the cross


# directory to store input data retrieved from haploqa
data_dir <- file.path(root, config_sample$data_dir)
# directory to store input data for qtl2
qtl2_dir <- file.path(root, config_sample$qtl2_dir)

print(config_sample)
##    array_type      annot_file      data_dir       qtl2_dir marker_type
## 1:         CC gm_uwisc_v2.csv haploqa_cc_gm cc_qtl2_genail    GigaMUGA
##                                                          url
## 1: http://haploqa-dev.jax.org/tag/Collaborative%20Cross.html
##                                                                    founders_list
## 1: 129S1/SvImJ, A/J, C57BL/6J, CAST/EiJ, NOD/ShiLtJ, NZO/HlLtJ, PWK/PhJ, WSB/EiJ
##    ngen         exclude_samples
## 1:    4 AF8, AF9, AFA, AFB, AFD
##                                                                       founder_url
## 1: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
##    truth_model
## 1:      risib8

From the above set up, the scripts will look for gm_uwisc_v2.csv in https://raw.githubusercontent.com/kbroman/MUGAarrays/main/UWisc/ as the annotation file. The raw input data files will be scraped from http://haploqa-dev.jax.org/tag/Collaborative%20Cross.html and these files will be stored in a folder, ‘haploqa_cc_gm’, in the same directory as the repo, for further wrangling. The wrangled data that’s ready for qtl2 read_cross function will be stored in a folder ‘cc_qtl2_genail’ also in the same directory as the repo. qtl2 will be use ngen = 4 to run haplotype reconstructions. The optimal model of CC is ‘risib8’ in qtl2.

The below results section will demonstrate how the pipelines are used for each cross and the results generated from them.

2.2. Results

As mentioned in the introduction, in order to decide whether to transition, the accuracy of haploqa and qtl2 each need to be compared to an optimal model. The pipelines in the sources script perform haplotype reconstruction tasks using genotyping data obtained from different types of data sources, specifically, 5 sets of GigaMUGA samples (Collaborative Cross, Diversity Outbred, F3, F2, and a GigaMUGA mutant cross), 70 individuals within the MiniMUGA sample. Decision was made to not include Muga/MegaMUGA data due to the discontinued usage of the array types.

2.2.1. GigaMUGA

The pipeline for GigaMUGA executes all the individuals within the sample. The contributing founders for all individuals within the GigaMUGA samples are always the same.

The pipeline to run haplotype reconstructions for an entire cross can be initiated using the function ‘haplotype_reconstruction_pipeline’. Please set toggle ‘samples_gen’ = True if the genotyping data for individuals need to be (re)downloaded from HaploQA, and set to False otherwise. please set toggle ‘qtl2_file_gen’ = F if the input files for qtl2 read_cross need to be (re)generated, and False otherwise.

The function automatically saves the computed haplotype reconstruction results to a number of RDS files, which you can find in the /results/RDS/ directory of this repo. Whenever the ‘qtl2_file_gen’ toggle is set to True to regenerate qtl2 input files for a cross, the according result RDS files for that cross need to be manually removed from the /results/RDS/ directory for the haplotype reconstruction results to be properly updated, as the functions that calculate discordance metrics and construct visualizations read the results from these RDS files.

  1. Results for DO
do_results <- haplotype_reconstruction_pipeline('DO', list_pheno, qtl2_file_gen = F, samples_gen = F)

Below are the genotype visualizations for some individuals in DO:

  1. Sample 6UY

Karyotype plot - sample 6UY in DO

  1. Sample 6WN

Karyotype plot - sample 6WN in DO

  1. Results for CC:
cc_results <- haplotype_reconstruction_pipeline('CC', list_pheno, qtl2_file_gen = F, samples_gen = F)
  1. Sample 35V

Karyotype plot - sample 35V in CC

  1. Sample 6MV

Karyotype plot - sample 6MV in CC

  1. Results for BXD F3:
bxd_results <- haplotype_reconstruction_pipeline('BXD', list_pheno, qtl2_file_gen = F, samples_gen = F)
  1. Sample 5PD

Karyotype plot - sample 5PD in F3

  1. Sample 5R2

Karyotype plot - sample 5R2 in F3

  1. Results for F2:
f2_results <- haplotype_reconstruction_pipeline('F2', list_pheno, qtl2_file_gen = F, samples_gen = F)
  1. Sample 8RS

Karyotype plot - sample 8RS in F2

  1. Sample 8T3

Karyotype plot - sample 8T3 in F3

The below two code chunks represent how the pipeline is used for the mutant mice crosses, where the contributing founders vary between crosses. The process involve a certain extent of filtering because some samples within the cross may only have one contributing founder, which does not work for genAIL as the algorithm requires two or more founders to perform haplotype reconstructions, and there is no simple way of identifying how many contributing fouders are present in each individual.

The function below again, saves the RDS files that contain haplotype reconstruction results for all samples in the same RDS directory as above (/results/RDS/), please make sure to remove th existing RDS file if the results need to be regenerated. The haplotype reconstruction results for each individuals can also be found in the directory where the qtl2 input files for read_cross are being stored - which should be the specified ‘qtl2_dir’ value in the config file.

  1. GigaMUGA Mutant cross
gm_res_dir <- file.path(results_dir, 'RDS', 'gm_results.rds')
gm_pct_dir <- file.path(shiny_pct_fp, 'gm_pct_comp.csv')

if (!file.exists(gm_res_dir)) {
  # ### GigaMUGA mutant
  gigamuga_url <- 'http://haploqa-dev.jax.org/tag/Jackson_Lab_Mahaffey_MURGIGV01_20160102_FinalReport.html'
  url_domain <- 'http://haploqa-dev.jax.org/'
  marker_type <- 'GigaMUGA'
  html_table <- read_html(gigamuga_url) %>% html_nodes("a") %>% html_attr("href")
  url_list <- paste0(url_domain, html_table[grepl('/sample', html_table)])
  summary_df <- sample_summary_scrape(read_html(gigamuga_url), url_list, marker_type)
  summary_df$`% No Call` <- as.numeric(gsub("%", "", summary_df$`% No Call`))
  summary_df <- summary_df[summary_df$`% No Call` < 10,]
  
  ind_gm <- summary_df[summary_df$`Haplotype Candidate` == 'False',]$`ID`
  ind_gm <- ind_gm[!ind_gm %in% c('RY', 'S8', 'SG', 'SR', 'Y2', 'YA', 'YJ', 'YT', 'XZ', 'Y9')] # only one founder
  
  gigamuga_results <- list()
  gm_pct_res <- list()
  skipped_inds <- list()
  for (ind in ind_gm) {
    print(ind)
    skip <- FALSE
    tryCatch(sample_gm_res <- sample_haplotype_reconstruction('MURGIGV01', ind, samples_gen = F, qtl2_file_gen = F), error = function(e) {skip <<- TRUE})
    if(skip) {
       skipped_inds <- c(skipped_inds, ind)
       next}
    df <- ind_geno_comp(sample_gm_res, ind, 'MURGIGV01')
  
    gm_pct_res[[ind]] <- df
    gigamuga_results[[ind]] <- sample_gm_res
  }
  print(paste0('Skipped individuals:', skipped_inds))
  gm_pct_res <- rbindlist(gm_pct_res)
  saveRDS(gigamuga_results, gm_res_dir)
} else {
  gigamuga_results <- readRDS(gm_res_dir)
}

if (!file.exists(gm_pct_dir)) {
  write.csv(gm_pct_res, gm_pct_dir, row.names = F)
} else {
  gm_pct_res <- fread(gm_pct_dir)
}

The selected GigaMUGA mutant cross contains some mice with two contributing founders and some genetically diverse DO mice:

  1. Sample ND

Karyotype plot - sample ND in GigaMUGA Mutant

  1. Sample YE

Karyotype plot - sample YE in GigaMUGA Mutant

2.2.2. MiniMUGA
mini_res_dir <- file.path(results_dir, 'RDS', 'MiniMUGA_results.rds')
mini_pct_dir <- file.path(shiny_pct_fp, 'mini_pct_comp.csv')

if (!file.exists(mini_res_dir)) {
  ### MiniMUGA pipeline
  minimuga_url <- 'http://haploqa-dev.jax.org/tag/MiniMUGA.html'
  url_domain <- 'http://haploqa-dev.jax.org/'
  marker_type <- 'MiniMUGA'
  html_table <- read_html(minimuga_url) %>% html_nodes("a") %>% html_attr("href")
  url_list <- paste0(url_domain, html_table[grepl('/sample', html_table)])
  summary_df <- sample_summary_scrape(read_html(minimuga_url), url_list, marker_type)
  ind_mini <- summary_df[summary_df$`Haplotype Candidate` == 'False',]$`ID`
  ind_mini <- ind_mini[ind_mini!='JXX', 'JY6'] # screentime error, skip this one
  ind_mini <- ind_mini[ind_mini!='JY9'] # only 1 contributing strain, AIL incompatible, skip this one as well
  
  minimuga_results <- list()
  mini_pct_res <- list()
  skipped_inds <- list()
  for (ind in ind_mini) {
    print(ind)
    skip <- FALSE
    tryCatch(sample_mini_res <- sample_haplotype_reconstruction('MiniMUGA', ind, samples_gen = F, qtl2_file_gen = F), error = function(e) {skip <<- TRUE})
    if(skip) {
       skipped_inds <- c(skipped_inds, ind)
       next}
    df <- ind_geno_comp(sample_mini_res, ind, 'MiniMUGA')
    
    minimuga_results[[ind]] <- sample_mini_res
    mini_pct_res[[ind]] <- df
  }
  print(paste0('Skipped individuals:', skipped_inds))
  mini_pct_res <- rbindlist(mini_pct_res)
  saveRDS(minimuga_results, mini_res_dir)
} else {
  minimuga_results <- readRDS(mini_res_dir)
}

if (!file.exists(mini_pct_dir)) {
  write.csv(mini_pct_res, mini_pct_dir, row.names = F)
} else {
  mini_pct_res <- fread(mini_pct_dir)
}
  1. Sample JXN

Karyotype plot - sample JXN in MiniMUGA

  1. Sample JYA

Karyotype plot - sample JYA in MiniMUGA

2.2.3. Shiny Visualizations

A dashboard was built in R Shiny to visualize the karyotype plot results of the haplotype reconstructions, the script to the Shiny app is ‘app.R’ in this directory. The below code chunk should be able to launch the Shiny app.

source(paste0(root,"/app.R"))
shinyApp(ui, server)

3. Statistical Results

To summarize the results, for GigaMUGA, we selected 277 samples from DO, 218 samples from CC, 71 samples from F3, 34 samples from F2, and 288 mutant samples. For MiniMUGA, we selected 25 samples. Among which, DO and CC are complex crosses with 8 contributing founders, F3 and F2 are relatively simple two-founder crosses, whereas the number of contributing founders in the mutant crosses samples can vary from 2 to 8. This is to ensure there are sufficient evidences to evaluate the accuracy of haplotype reconstruction results for crosses of all computational sizes and complexity.

There are three parts of the comparison: 1. haploqa results, which can be obtained directly from haploqa.jax.org by downloading the SNP level haplotype reports from each sample. 2. qtl2 genAIL results, which was calculated using the R package ‘qtl2’ with the crosstypes ‘genail’ 3. qtl2 best model, which was calculated using the qtl2package with the original crosstype

3.1. Truth models

At a glance of the visualizations which present a high-level idea of the difference between the haplotype reconstruction results between qtl2 and haploqa, there are some visible differences between the two models - specifically, qtl2 appears to have more crossovers and come minor areas appear to have different founder results. To determine which model is more correct, some ‘truth’ models, which the algorithms have been used (for a long time?), were also constructed for each of the crosses.

As mentioned in the configuration file, the truth model for CC uses crosstype ‘risib8’, the truth model for DO uses crosstype ‘do’, the truth model for BXD F3 uses crosstype ‘ail’, and the truth model for F2 uses crosstype ‘f2’. The executions of pipelines for the truth models can be initiated using the function, ‘truth_model_reconstruction’. Same as the previous haplotype reconstruction functions, please set toggle ‘samples_gen’ = True if the genotyping data for individuals need to be (re)downloaded from HaploQA, and set to False otherwise. please set toggle ‘qtl2_file_gen’ = F if the input files for qtl2 read_cross need to be (re)generated, and False otherwise. Please also remove the existing result RDS files for the according cross (if any) whenever the qtl2 input files were regenerated to make sure the results are updated properly.

do_truth_results <- truth_model_reconstruction('DO', list_pheno, qtl2_file_gen = F, samples_gen = F)
cc_truth_results <- truth_model_reconstruction('CC', list_pheno, qtl2_file_gen = F, samples_gen = F)
bxd_truth_results <- truth_model_reconstruction('BXD', list_pheno, qtl2_file_gen = F, samples_gen = F)
f2_truth_results <- truth_model_reconstruction('F2', list_pheno, qtl2_file_gen = F, samples_gen = F)
3.2. Quantitative comparisons
GigaMUGA samples:

For visualizations, we constructed boxplots that demonstrate the percentage of marker that do not have the same results between qtl2, haploQA and the truth model:

# filepath to the csvs
# DO
do_pct_df <- fread(file.path(shiny_pct_fp, 'do_truth_comp.csv'))
# CC
cc_pct_df <- fread(file.path(shiny_pct_fp, 'cc_truth_comp.csv'))
# BXD f3
bxd_pct_df <- fread(file.path(shiny_pct_fp, 'bxd_truth_comp.csv'))
# f2
f2_pct_df <- fread(file.path(shiny_pct_fp, 'f2_truth_comp.csv'))
# GigaMUGA Mutant
gm_pct_res <- fread(file.path(shiny_pct_fp, 'gm_pct_comp.csv'))
# MiniMUGA
mini_pct_res <- fread(file.path(shiny_pct_fp, 'mini_pct_comp.csv'))
# wrangle csv files for plots
do_qtl2_truth_diff <- do_pct_df$qtl2_pct_diff
cc_qtl2_truth_diff <- cc_pct_df$qtl2_pct_diff
bxd_qtl2_truth_diff <- bxd_pct_df$qtl2_pct_diff
f2_qtl2_truth_diff <- f2_pct_df$qtl2_pct_diff

# align lengths
max_n <- max(nrow(do_pct_df), nrow(cc_pct_df), nrow(bxd_pct_df), nrow(f2_pct_df))

length(do_qtl2_truth_diff) <- max_n 
length(cc_qtl2_truth_diff) <- max_n
length(bxd_qtl2_truth_diff) <- max_n
length(f2_qtl2_truth_diff) <- max_n

df_qtl2_truth_diff <- as.data.frame(cbind('do' = do_qtl2_truth_diff, 'cc' = cc_qtl2_truth_diff, 'f3' = bxd_qtl2_truth_diff, 'f2' = f2_qtl2_truth_diff))
plot_df <- melt(df_qtl2_truth_diff)
## No id variables; using all as measure variables
library(ggplot2)    
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('qtl2 - proportion discordance between optimal and implemented models')
## Warning: Removed 508 rows containing non-finite values (stat_boxplot).

According to the visualization, which shows the percentage of marker discordance between qtl2 genail and the truth models - DO, F3 and F2 are all producing results that are almost exactly the same as the optimal model, with CC having some discrepancies. We believe the reason why CC performs so differently is that the optimal model for CC does not allow any heterozygosity, whereas the genail does, so the genail model showed some residual heterozygosity from the CC mice which the optimal model does not have the ability to show.

do_haploqa_truth_diff <- do_pct_df$haploqa_pct_diff
cc_haploqa_truth_diff <- cc_pct_df$haploqa_pct_diff
bxd_haploqa_truth_diff <- bxd_pct_df$haploqa_pct_diff
f2_haploqa_truth_diff <- f2_pct_df$haploqa_pct_diff

# align lengths
max_n <- max(nrow(do_pct_df), nrow(cc_pct_df), nrow(bxd_pct_df), nrow(f2_pct_df))

length(do_haploqa_truth_diff) <- max_n 
length(cc_haploqa_truth_diff) <- max_n
length(bxd_haploqa_truth_diff) <- max_n
length(f2_haploqa_truth_diff) <- max_n

df_haploqa_truth_diff <- as.data.frame(cbind('do' = do_haploqa_truth_diff, 'cc' = cc_haploqa_truth_diff, 'f3' = bxd_haploqa_truth_diff, 'f2' = f2_haploqa_truth_diff))
plot_df <- melt(df_haploqa_truth_diff)
## No id variables; using all as measure variables
library(ggplot2)    
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('haploqa - proportion discordance between optimal and implemented models')
## Warning: Removed 508 rows containing non-finite values (stat_boxplot).

According to the visualization, which shows the percentage of marker differences between haploqa and the truth models - the haploqa results for f2 and f3 still seem to be similar to the optimal model, but DO and CC seem to be relatively incorrect results.

do_haploqa_truth_diff <- do_pct_df$haploqa_qtl2_pct_diff
cc_haploqa_truth_diff <- cc_pct_df$haploqa_qtl2_pct_diff
bxd_haploqa_truth_diff <- bxd_pct_df$haploqa_qtl2_pct_diff
f2_haploqa_truth_diff <- f2_pct_df$haploqa_qtl2_pct_diff

# align lengths
max_n <- max(nrow(do_pct_df), nrow(cc_pct_df), nrow(bxd_pct_df), nrow(f2_pct_df))

length(do_haploqa_truth_diff) <- max_n 
length(cc_haploqa_truth_diff) <- max_n
length(bxd_haploqa_truth_diff) <- max_n
length(f2_haploqa_truth_diff) <- max_n

df_haploqa_truth_diff <- as.data.frame(cbind('do' = do_haploqa_truth_diff, 'cc' = cc_haploqa_truth_diff, 'f3' = bxd_haploqa_truth_diff, 'f2' = f2_haploqa_truth_diff))
plot_df <- melt(df_haploqa_truth_diff)
## No id variables; using all as measure variables
library(ggplot2)    
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('haploqa vs qtl2 - proportion discordance between implemented models')
## Warning: Removed 508 rows containing non-finite values (stat_boxplot).

For the above three plots, there are two significant outliers at 80% discordance showing for CC, which are samples V6 and VP. Their haplotype reconstruction results appear to be highly heterozygous as CC mice, which explains the high discordance rate. However, it is believed that these two mice samples may have been mistakenly labeled as CC, as the genomes of CC mice are generally homozygous. Further investigation may be needed on the two samples.

Below are the karyotype plots for the two samples:

Karyotype plot - sample VP in CC Karyotype plot - sample V6 in CC

Mutant Mice samples:

The two mutant mice crosses, GigaMUGA Mutant and MiniMUGA, do not have a tailored optimal model. Therefore, plotting the percentage of marker discordance between qtl2 genail and haploQA.

MiniMUGA:

mini_plot <- mini_pct_res %>% group_by(sample_id) %>% summarise(qtl2_pct_diff = sum(haplo_diplotype != qtl2_calls)/n()) %>% as.data.frame()
plot_df <- melt(mini_plot)
## Using sample_id as id variables
plot_df$variable <- 'MiniMUGA'
   
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('MiniMUGA haploqa vs genail - % of marker discordance') + xlab('model')

The MiniMUGA samples are mutant mice samples, meaning they often have founders that contribute to only a few markers. The above boxplot shows a relatively significant high number of markers that are different between haploQA and qtl2 genail, which shows a weakness for qtl2: due to the way HMM models determines genome states based on probabilities - when they are extra founders, or founders that only contribute to a few markers, qtl2 genail tends to be ‘confused’ and bounce states in the haplotype reconstruction process, whereas haploQA is capable of staying in states.

The outlier at approximately 70% is sample JXV, which according to its karyotype on HaploQA, this sample has a significant mount of genetic variations for a mutant mice. This could also affect the way qtl2 decides the geno phases and cause the discordance

Karyotype plot - sample JXV in MiniMUGA

GigaMUGA mutant:

gm_plot <- gm_pct_res %>% group_by(sample_id) %>% summarise(qtl2_pct_diff = sum(haplo_diplotype != qtl2_calls)/n()) %>% as.data.frame()
plot_df <- melt(gm_plot)
## Using sample_id as id variables
plot_df$variable <- 'GigaMUGA Mutant'
   
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('GigaMUGA Mutant haploqa vs genail - % of marker discordance') + xlab('model')

The results between genail and haploQA appear to be highly similar, with an average of less than 2% marker discordance between the models. The outlier at approximately 50% is the sample, ‘PX’. The karyotype plot is show as below:

Karyotype plot - sample PX in GigaMUGA Mutant

4. Conclusion

Overall, the existing results for qtl2 genail and haploqa show that the two methods have very similar performance for most of the crosses, with genail performing significantly better for the DO outbred mice samples. HaploQA handles the extra founders being present in the haplotype reconstruction process better, however, qtl2 genail model has stronger capabilities to produce more accurate results for complicated crosses. So based on the results we have so far, we’re leaning toward that genail is more accurate than haploqa.

Aside from the statistical metrics comparison between the haplotype reconstruction results, it is also worth taking the programmatic strengths and weaknesses of the two models into consideration:

HaploQA:

  1. advantages:

    1. Not a programming tool - people without relevant knowledge can still use the tool and perform haplotype reconstruction

    2. Works for all MUGA array-based genotyping data

    3. Detailed and concise visualizations on web interface

  2. disadvantages:

    1. haplotype reports are not easy to access computationally

    2. source codes are also not easy to access computationally

    3. expensive to maintain at the organization’s costs

R/qtl2:

  1. advantages:

    1. fully accessible computationally

    2. capable of performing other genetic analysis

    3. currently maintained by developers

  2. disadvantages:

    1. algorithms is probability based, tends to bounce states when low-contribution founders are present

    2. visualizations are less detailed and need to be generated on demand

    3. does not scale well with large number of founders, run time increases exponentially

    Genoprob runtime plot

5. References

  1. http://haploqa-dev.jax.org/help.html

  2. Broman KW, Gatti DM, Simecek P, Furlotte NA, Prins P, Sen Ś, Yandell BS, Churchill GA. R/qtl2: Software for Mapping Quantitative Trait Loci with High-Dimensional Data and Multiparent Populations. Genetics. 2019 Feb;211(2):495-502. doi: 10.1534/genetics.118.301595. Epub 2018 Dec 27. PMID: 30591514; PMCID: PMC6366910.

Known issues:

  1. The webpages throws a screentime error while downloding some individuals from haploQA - this is presumably a Mac issue. The functions automatically skip the samples with screentime error issue, however, this may become an issue when a screentime error rises up while getting the founder data because the function would skip that founder, and the user needs to manually download it. Please make sure that all founders are present before proceeding with the pipelines.

  2. Error in dev.off() : QuartzBitmap_Output - unable to open file ’/Users/linc/.local/share/rstudio/notebooks/AFB3D6E3-draft_report/1/8779F85C5B04ECDA/cvdr4ccbjgppi_t/_rs_chunk_plot_010.png’

This is likely an R markdown issue or some unclosed connections in the backend environment after knitting the R markdown - please make sure to clear the workspace or restart R if this happens.